-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add Generic Snapshot Support Using Kubernetes VolumeSnapshots (also for complete Notebooks) #217
base: master
Are you sure you want to change the base?
Conversation
@StefanoFioravanzo @elikatsis Is it an idea to merge this pull request into a new branch so others can help with further development? |
fe22a5e
to
2c7f878
Compare
Update: Add ability to snapshot all the PVCs atached to the pod. If using |
321a058
to
89cffde
Compare
Update: I have just added the ability to restore a Notebook server from a "Notebook snapshot". In reality this is an individual snapshot of each PVC mounted to a Notebook server with some metadata written to each snapshot. You can call the function |
479e801
to
d84f419
Compare
5325649
to
361c879
Compare
Update: This is a first step to allow Kale to be storage provider agnostic when it comes to taking snapshots, as the other functions I wrote should universal. For this functionality to work the following ClusterRole and ClusterRoleBinding are needed to be able to detect the VolumeSnapshotClasses and Storage Classes. As before, this might not be secure in every environment so proceed with caution.
|
a256a8e
to
3379418
Compare
927bf87
to
79c7c10
Compare
79c7c10
to
c354d25
Compare
@davidspek do I understand correctly that by default Kale can stores snapshots only in Rok storage from Arrikto project?
I guess I must have this Rok storage as additional component. If MiniKf is free to check Kale locally then maybe should I switch it at the begging ? And how is going with testing your code from this PR? I just see here the WIP status, so I think it's still in developing, right ? |
I have separated Ceph cluster with CephFS and it already connected with K8s with using CSI. So, maybe I can check it with RWX volume from CephFS storage class. |
@mr-yaky Kale at the present time only supports Rok. MiniKF includes Rok so you can use that to test with all features enabled. As you can see I haven't looked at this code in a month now, as I was busy with a KFP PR to automatically set the namespace in notebooks (which has now been merged) along with other work. The next step, once the above error is figured out (and probably a few more), is too look at how to properly implement the metadata. At the moment the snapshot name is hardcoded, thus creating a second snapshot from a notebook will cause an error. This could be solved by adding a UUID, however I was thinking a meaningful UUID such as the snapshot content might be more useful for backtracking a snapshots source later on. Similarly, while it is already possible to restore a notebook from a snapshot of a pvc that was mounted to the notebook (even if it had multiple PVCs) by creating new PVC(s) from the snapshot, there should be a closer look at what all the relevant metadata is that should be used for this. I've been in contact with @StefanoFioravanzo about this for a while, and I hope to have some feedback on the code and the implementation soon to be able to look at this further. If you have the time and energy to test out this PR and provide feedback and suggestions it is very welcome as this is my first ever attempt at programming. If you have deployed the |
c354d25
to
867de68
Compare
Got it, then I'll try to test first with MiniKF and I'll be able to test this PR on next integration step for Kubeflow. |
return snapshot_names | ||
|
||
|
||
def check_snapshot_status(snapshot_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what is causing erros in Kale after the snapshot has been created.
867de68
to
0d2aeaf
Compare
Functions are not yet optimized, but serve as working examples.
… pvc from snapshot Creating a PVC from a snapshot is now following the same coding structure as `rokutils.py` and checks if a snapshot is ready before trying to create a PVC from it.
Add ability to snapshot all the PVCs atached to the pod. If using `snapshot_notebook` snapshot all PVCs connected to the notebook and add some notebook metadata as labels (pod name, container name, if the volume was the workspace directory) and annotations (mount path and docker image of the container) to the snapshot. A `list_pvc_snapshots` function is added that can be used to list snapshots filter using a label selector. For example: `list_pvc_snapshots(label_selector="pod=dev-0, is_workspace_dir=True")`. Eventually, the plan is to be able to start a new notebook server from a (set of) snapshot(s) using the metadata, similar to how Rok functions.
If a snapshot is made of a notebook using "snapshot_notebook", a snapshot of each of the mounted PVCs is made and some metadata of the notebook is written to the snapshots. The "restore_notebook" function can be called with the name of a PVC snapshot that was part of the notebook as an argument(only 1 snapshot name is needed, "get_nb_pvcs_from_snapshot" will find the other mounted volumes). The function will then create PVCs from the snapshot(s) needed for the Notebook server and launch a new Notebook server with those PVCs mounted.
Rather than hardcoding the supported storage classes and provisioners in the `_list_volumes` function it will now check if the provisioner of the PVCs has an accompanying VolumeSnapshotClass.
…nt the generic nature of the functions within
I'm not sure if this is properly implemented for all scenarios.
If taking snapshots is possible with the current CSI driver, selecting "Use this notebook's volumes" or "Clone Notebook Volume" will create a snapshot of the notebook'ss PVC (multiple PVCs not tested but should work as well). At this point there is still an issue with getting the snapshot status when performing this procedure.
a83c38e
to
6610bed
Compare
This pull request aims to add support for taking snapshots of notebook persistent volumes using Rook CephFS as Storage Class, similar to the functionality Kale already has with Rok. Due to the fact that the standard Kubernetes
VolumeSnapshot
API is used, this pull request could be extended to all CSI drivers that support snapshot functionality. Do note that Kale relies onReadWriteMany
support to be able to execute pipeline steps in parallel. There is a method to useReadWriteOnce
volmues; however, this requires manual editing of the created pipeline before uploading.Useful information:
The API Objects
VolumeSnapshot
,VolumeSnapshotContent
andVolumeSnapshotClass
are CRDs and not part of the Core API. As such the CustomObjectsAPI is used to create CephFS snapshots: https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CustomObjectsApi.md#delete_namespaced_custom_objectTo create a PVC from a CephFS snapshot the Core API is used.
To allow a notebook to create Rook CephFS snapshot a ClusterRole and RoleBinding for the default-editor service account in the given namespace must be applied. Below are examples for use with the namespace "admin". Any tips, comments or recommendations to improve the security of the ClusterRole and RoleBinding and how it can best be deployed are greatly appreciated.
!!!WARNING!!!
Might not be secure, only use for testing!